Identification of potential therapeutic intervening targets by in-silico analysis of nsSNPs in preterm birth-related genes

Prematurity is the foremost cause of death in children under 5 years of age. Genetics contributes to 25–40% of all preterm births (PTB) yet we still need to identify specific targets for intervention based on genetic pathways. This study involved the effect of region-specific non-synonymous variations and their transcript level mutational impact on protein functioning and stability by various in-silico tools. This investigation identifies potential therapeutic targets to manage the challenge of PTB, corresponding protein cavities and explores their binding interactions with intervening compounds. We searched 20 genes coding 55 PTB proteins from NCBI. Single Nucleotide Polymorphisms (SNPs) of concerned genes were extracted from ENSEMBL, and filtration of exonic variants (non-synonymous) was performed. Several in-silico downstream protein functional effect prediction tools were used to identify damaging variants. Rare coding variants were selected with an allele frequency of ≤1% in 1KGD, further supported by South Asian ALFA frequencies and GTEx gene/tissue expression database. CNN1, COL24A1, IQGAP2 and SLIT2 were identified with 7 rare pathogenic variants found in 17 transcript sequences. The functional impact analyses of rs532147352 (R>H) of CNN1 computed through PhD-SNP, PROVEAN, SNP&GO, PMut and MutPred2 algorithms showed impending deleterious effects, and the presence of this pathogenic mutation in CNN1 resulted in large decrease in protein structural stability (ΔΔG (kcal/mol). After structural protein identification, homology modelling of CNN1, which has been previously reported as a biomarker for the prediction of PTB, was performed, followed by the stereochemical quality checks of the 3D model. Blind docking approach were used to search the binding cavities and molecular interactions with progesterone, ranked with energetic estimations. Molecular interactions of CNN1 with progesterone were investigated through LigPlot 2D. Further, molecular docking experimentation of CNN1 showed the significant interactions at S102, L105, A106, K123, Y124 with five selected PTB-drugs, Allylestrenol (-7.56 kcal/mol), Hydroxyprogesterone caproate (-8.19 kcal/mol), Retosiban (-9.43 kcal/mol), Ritodrine (-7.39 kcal/mol) and Terbutaline (-6.87 kcal/mol). Calponin-1 gene and its molecular interaction analysis could serve as an intervention target for the prevention of PTB.


Introduction
Polymorphism Phenotyping (PolyPhen2). The functional impact of substitution of an amino acid on collective proteins function and its stability was scored through PolyPhen2 (Polymorphism Phenotyping, http://genetics.bwh.harvard.edu/pph2/dbsearch.shtml). The prediction (score) output is categorized as 'most likely damaging', 'may be damaging', and 'benign' with specificity and sensitivity values for a mutation [18] and was used as for gene annotation.
Combined Annotation Dependent Depletion (CADD) score. CADD is preferably used to filter the deleteriousness of SNVs as insertion/deletion variants in various genomic dataset of humans. CADD score with 20 or greater was chosen to select nsSNPs because this scale strongly correlates with allelic diversity, variant pathogenicity and experimentally measured regulatory effects within individual genome sequences [19].
The Rare Exome Variant Ensemble Learner (REVEL). To assess the potential pathogenicity of SNV, REVEL is used to assess the pathogenicity of missense variants. This tool is not used as a descriptive prediction method but for ease, it displays scores greater '0.5', as 'likely disease causing' and display scores lower to '0.5' as 'likely benign'. It was estimated that 75.4% of disease mutations but only 10.9% of neutral variants have a score above 0.5 [20]. MetaLR. This logistic regression-based tool is used to categorized SNV as 'damaging' or 'tolerated'. In this case, a score between 0 and 1 is also provided and variants with higher scores are more likely to be deleterious [21].
Mutation assessor class. This tool predicts the functional impact of amino acid substitutions. The prediction is classified as 'high' that represented any variants with more likely to be deleterious in genomic datasets [22].

Identification of regional nsSNPs by allele frequency aggregator (ALFA) of South Asian populations
Allele Frequency Aggregator (ALFA) is one of the largest and most comprehensive aggregated variant datasets of allele frequency with open access available from the NCBI database of Genotypes and Phenotypes (dbGaP) [23,24]. ALFA computes the allele frequency for variant in dbGaP across approved un-restricted studies and provides the data as open access through dbSNP [25]. With the help of ALFA, the total sample size with reference to alternate allele of filtered/identified variants both globally and in South Asian populations were identified and listed.

Transcript level analysis of the functional impacts of nsSNPs using different in-silico algorithms
All exonic missense SNPs, retrieved and analyzed through earlier mentioned strategies were further investigated by the transcript sequence-based computational analysis through various Machine Learning (ML) algorithms, to further scrutinized the nsSNPs as whether pathogenic or not. These in silico approaches include: PANTHER-PSEP. PANTHER-PSEP was used to calculate the score of position-specific evolutionary preservations through online classification system, such as, http://www. pantherdb.org/tools/csnpScore.do. It measures the length of time (in millions of years 'my'), a position in a current protein has been preserved by tracing back to its reconstructed direct ancestors. The longer a position has been preserved, the more likely that it will have a deleterious effect. The thresholds were: "probably damaging" (time > 450my, corresponding to a false positive rate of~0.2 as tested on HumVar), "possibly damaging" (450my > time > 200my, corresponding to a false positive rate of~0.4) and "probably benign" (time < 200my) [26].
PhD-SNP. PhD-SNP is a pathogenicity-predicting tool based on support vector machines through web server (https://snps.biofold.org/phd-snp/phd-snp.html). The Swiss-Prot database was used for training and the prediction was used with 20-fold cross validation [27].
PROVEAN. PROVEAN is sequence homology-based tools which have independently developed algorithms through web server available on http://provean.jcvi.org/index.php. The cutoff score of PROVEAN is -2.5 and mutations with scores over -2.5 are predicted to be pathogenic [28].
SNPs&GO. SNPs&GO from a protein sequence predict whether a variation is disease related or not by exploiting the corresponding protein functional annotation available on https://snps.biofold.org/snps-and-go/snps-and-go.html. SNPs&GO collects in unique framework information derived from protein sequence, evolutionary information, and function as encoded in the Gene Ontology (GO) terms, and outperforms other available predictive methods. SNPs&GO includes information extracted from the protein 3D structure (SNPs&GO3d) with respect to sequence-based method [29].
SNAP2. SNAP2 is a neural network-based classifier. SNAP2 predicts the impact (effect) of single amino acid substitutions on protein function (https://rostlab.org/services/snap2web/) that distinguishes between effect and neutral variants/nsSNPs by taking a variety of sequence and variant features into account. A high score (>50 indicates strong signal for effect), white indicates weak signals (-50<score<50), and a low score (score<-50, strong signal for neutral / no effect [30]. PMut. PMut (http://mmb.irbbarcelona.org/PMut/) is a web-based tool for the annotation of pathological variants on proteins, was trained and tested by the manually created database SwissVar (October 2016 release), which includes 27,203 harmful and 38,078 benign mutations for 12,141 proteins. The prediction scores of PMut are from 0 to 1, and the cutoff value is set to 0.5 (neutral, 0 to 0.5; pathological, 0.5 to 1) [31].
MutPred2. A web-server tool developed to categorize and predict amino acid substitutions on functional protein as pathogenic or benign in human, available through http:// mutpred.mutdb.org/. The scores range from 0.5 to 1.0 are classify as 'pathogenic mutation' whereas, values less than 0.5 are 'benign' [32].
In-silico supervised learning-based approaches to characterize nsSNPs impact on protein stability and evolutionary conservation associated with structural properties of proteins MUpro. MUpro is a set of ML programs to predict how single-site amino acid mutation affects protein stability (http://mupro.proteomics.ics.uci.edu/). It was developed on the basis of two machine learning methods i.e., Support Vector Machines (SVM) and Neural Networks (NN). The advantage of this in-silico algorithm is that they do not require tertiary structures to predict protein stability changes [33].
I-Mutant. I-Mutant v.3.0 is a predictor of protein stability changes upon mutations. I-Mutant v.3.0 is an SVM-based tool for the automatic prediction of protein stability changes upon single point mutations at pH 7 and temperature equal to 25˚C. The predictions were performed from the protein sequence and can be used both as a classifier for predicting the sign of the protein stability change upon mutation and as a regression estimator for predicting the related DeltaDeltaG (ΔΔG) values (http://gpcr2.biocomp.unibo.it/~emidio/I-Mutant3.0/ I-MutantSuite_Help.html). Through I-Mutant predictions, the ternary classificatory has an extra class, namely Neutral, that refers to small ΔΔG values changes upon (WT/New) single point protein mutations. The range of ΔΔG values for a Neutral mutation classification is: -0.5 = <DDG = <0.5, while changes <-0.5 are classified as Large Decrease and changes >0.5 are classified as Large Increase [34].
iPTREE-STAB. iPTREE-STAB/iStable v.2.0 is an interpretable decision tree-based method for discriminating the stability of proteins stabilizing or destabilizing (predicting their stability changes, ΔΔG) upon single amino acid substitutions from amino acid sequence. The discrimination and predictions are mainly based on decision tree coupled with adaptive boosting algorithm, and classification and regression tree, respectively, using three neighboring residues of the mutant site along N-and C-terminals (http://ncblab.nchu.edu.tw/iStable2/ seqsubmit.html). This tool also computes the secondary structural information and surface accessibility of protein sequence with mutation [35,36].
INPS-MD. INPS-MD (Impact of Non-synonymous mutations on Protein Stability-Multi Dimension, (https://inpsmd.biocomp.unibo.it) is a method used to predict stability of protein variants from sequences and structures. The INPS-MD predictor using sequences is based on a simplified support vector (SVR) as implemented by the libsvm package, which was only tested by linear and radial basis function (RBF) kernels. INPS-MD predictions can be interpreted to identify stabilizing (ΔΔG>0) and destabilizing (ΔΔG<0) variations [37].
ConSurf. ConSurf software (http://consurf.tau.ac.il) was used to analyze the evolutionary conservation of amino acids by calculating the conservation score through unique algorithm.
The algorithm calculates normalized conservative score using Bayesian method for calculating rates with a confidence interval to each of the inferred evolutionary conservation scores. The amino acids with scores between 7 and 9 were evolutionary conservative amino acids [38].
HOPE. The HOPE software (Have (y) Our Protein Explained, http://www.cmbi.ru.nl/ hope/input/) was used to predict the effect of amino acid mutation/SNPs on physical and chemical properties, hydrophobicity, spatial structure, and function of proteins [40].

Exploring protein structural coverage expressed in female reproductive tissues
The Genotype-Tissue Expression (GTEx) database cataloged a large number of tissue-specific genotypes and shared regulatory expression quantitative trait loci (eQTL) variants. We used GTEx to identify the best expressed gene in the female reproductive tissues (bladder, cervixectocervix, endocerivx, fallopian tubes, ovaries, uterus, and vagina) that could impact PTB status at protein level [41]. After investigating the concerned genes/proteins with identified tissue-specific regulatory expression in female reproductive tissues, the Protein Data Bank (PDB) structural coverage details from previous reported studies were investigated from PDB [42].

Homology modeling and 3D structural assessment of highly expressed PTB protein
Protein Homology/analogY Recognition Engine V 2.0 (Phyre2) tool was used for homology modeling of concerned PTB protein [43]. Following two attributes in protein selection for homology modeling were also kept considered for inclusion: 1. Protein highly expressed with a large number of tissue-specific and shared regulatory variants in female reproductive tissues.
2. Protein with highest structural coverage information (retrieved through PDB) among others (found with rare coding variants).
After following the above-mentioned inclusion criteria, CNN1 was selected for 3D homology modeling. The target sequence of Calponin1 (CNN1 Isoform 2) protein was searched from UniProt database (Entry No. P51911-2). PHYRE2 tool was used for homology modeling that consisted of sub-algorithmic stages: Stage 1-Gathering Homologous Sequences: CNN1 isoform 2 sequence was scanned against the specially curated NR20 (No of sequences with >20% mutual sequence identity) protein sequence database with HHblits [44]. The resulting Multiple Sequence Alignment (MSA) was used to predict the secondary structure with PSI-blast based secondary structure PREDiction (PSIPRED) [45], and both the alignment and secondary structure prediction combined into a query Hidden Markov Model (HMM).

Stage 2-Fold Library Scanning:
The models were scanned against a database of HMMs [46] of proteins of known structure. The top-scoring alignments from this search were used to construct crude backbone-only models.
Stage 3-Loop Modeling: Indels in these models were corrected by loop modeling. Stage 4-Side-chain Placement: Amino acid side chains were added to generate the final PHYRE2 models.
The best homology models were selected based on the top-ranked modeled structures (according to similar template pattern). The alignment description of templates obtained through manual protein-blast comparison with the highest percent identity at 100% confidence was generated by Phyre2 tool. The quality assessment of stereochemical properties of 3D homology models were carried out by PROCHECK [47]. Ramachandran plot and residual properties of constructed 3D model along with the dihedral angles of φ against ψ of all possible conformations of amino acids in protein structure had also been studied in Ramachandran plot [48]. The validation of 3D protein models were also performed by SAVES (Structure Validation Server: https://saves.mbi.ucla.edu/) web server [49] to know the probable structural errors and z-score of the chosen model. SMART-EMBL tool (http://smart.embl-heidelberg.de/ ) was used to compute the confidently predicted domains, repeats, motifs and low complexity region (LCR) in protein. The SuperPose webserver (http://superpose.wishartlab.com/) was used to calculate both sequence alignment between template and 3D homology model through structure superposition using modified quaternion eigenvalue approach to generate RMSD statistics of the superimposed molecules [50].

Validation of binding cavities by consensus mode
To detect protein binding cavities, cavity-detection guided Blind Docking approach was used to explore the binding cavities of homology models. CB-Dock is an automatic protein-ligand docking approach which identifies the binding cavities/sites. CB dock compared cavities and ranked through a method called 'CurPocket' with state-of-the-art protein-ligand binding site prediction methods using the benchmark set of COACH [51], as prediction methods. CB dock also calculates the center and size of docking box of putative cavity as a key parameter of process with novel curvature-based cavity detection approach. This method was carefully optimized and achieved~70% success rate for the top-ranking poses whose root mean square deviation (RMSD) were within 2 Å from the X-ray pose [52]. Progesterone (as various synthetic derivate are commonly used to manage PTB) was used as ligand molecule, downloaded from the PUBCHEM compound library (PubChem CID 5994).

Selection of drug molecules involved in PTB management
Drug used in PTB management were explored from the publicly accessible DrugBank database having comprehensive information about drugs and drug targets [53]. The search strategy included the key terms and synonyms, adapted to retrieve PTB-related drugs and therapeutics were "preterm birth OR preterm labor OR preterm delivery". The visual inspection of all compounds was performed to ensure the arrangements of respective constituting atoms as per 3D molecular structures. Furthermore, stable molecular conformations with arrangement in space was carried out by the energy minimization of 3D structures of PTB therapeutic compounds. According to 'drug type' category, only small molecules were filtered by Lipinski's rule of five [54]. After analysis, only those PTB drugs were selected which had PASS all five physicochemical properties of Lipinski's rule, as any single violation were treated as FAIL.

In-silico molecular mechanics and energetic estimations of PTB drugs interacted with CNN1
AutoDock Vina software (version 4.2) was used for protein-drug analysis and to estimate the molecular interactions of PTB proteins with selected drugs (ligand) molecules [55]. The best obtained cavity binding pattern (homology model with progesterone) of CB-Dock with lowest free energy and RMSD value were selected, and molecular interacting targets were further used as prominent residues in this cavity-guided AutoDock approach. Polar hydrogens were added, and partial charges were assigned to the standard residue using Gasteiger partial charge, which assumes that all hydrogen atoms were represented explicitly. The most favorable binding interactions were estimated through lowest predicted free energy of binding with best molecular docking simulation pose.

In-silico pharmacokinetics predictions of the screened PTB drugs
The ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) approach was used for in-silico pharmacokinetic predictions of the selected PTB drugs. ADMET lab platform (http://admet.scbdd.com/home/index/) [56] was used to access the ADMET properties. The assessment was carried out for each physiochemical property by submitting a SMILE format of the individual compound.

PTB-related genes and proteins
Overall, 55 proteins (isoforms inclusive) of 20 genes were found with PTB-related search query from NCBI duration from 1 st January, 2009 to 30 th June, 2021) ( Table 1). Out of 20 genes, 18 genes have had reviewed reference sequence status while the remaining two have validated status (SKA2 and CNN1). 9 genes were located on the reverse strands and 11 genes were located on the forward strand. The significant gene expression in endometrium and other female reproductive tissues were exhibited by 9 genes, including PAEP, SERPINH1, PLAG1, CNN1, IL6, ADAMTS4, SIGLEC6, PBX1, and KLRC1. PLAG1 gene has the highest transcription counts of 28. The total exon counts were also extracted, which indicated that COL24A1 gene has the highest numbers of exons (66) while IQGAP and SLIT2 have 41 and 40 exons, respectively. The highest numbers of isoforms were found in PLAG1, i.e., 20 isoforms ( Table 1).

nsSNPs of PTB genes
Through ENSEMBL gene ID of each gene, the genetic variations were extracted. ROBO1 gene has the highest numbers of SNPs, i.e., 278,967, while ADAMTS4 has very least numbers of SNPs, i.e., 5,757 (Fig 2). Likewise, the highest exonic SNPs in the identified PTB-related genes were observed in ROBO1 genes, i.e., 16,467 while the lowest exonic SNPs were found in IL4, i.e., 630. The exonic SNPs were also filtered with the help of VEP module. Total numbers of nsSNPs were observed with much reduced in numbers as compared to the previous exonic counts in each gene. The nsSNPs were highest in ROBO1 gene (13,235) and IL4 has the lowest numbers of nsSNPs, i.e., 467 (Fig 2).

Extraction of rare high risk nSNPs supported by transcript level computation of ALFA in South Asian population of PTB genes
All exonic variants of 20 PTB-related genes were further filtered based on MAF�0.01 and downstream functional effects (Fig 3). Based on transcript sequences, the output kept only 4 genes having 36 rare coding variants, that is, CNN1 with 4, COL24A1 with 1, IQGAP2 with 26 and SLIT2 with 5 nsSNPs (Fig 4). The various gene annotation statuses of all 4 genes having deleterious nsSNPs were shown in Table 2.

PLOS ONE
identified. It was observed that ALFA frequencies were available only for two genes i.e., COL24A1 and IQGAP2 (S1 Table). The total count of rare nsSNPs identified in previous steps (36) was reduced to 27 (S1 Table). 1 deleterious variant was identified from COL24A1 gene, and 26 deleterious variants were from IQGAP2 which have computed ALFA (Global and South Asian) sample size with frequency details of reference and alternate alleles (S1 Table).

In-silico transcript level functional impacts analysis of rare nsSNPs variants
We retrieved 36 transcript-level high-risk PTB-related nsSNPs in four genes, followed by further validation with six different algorithms to evaluate the functional impacts (Table 3). Missense variant rs532147352 of CNN1 found in 4 different transcripts have same amino acid variation (R>H) at four residue position according to their sequence length. PANTHER predicted all four transcript-level variations as 'probably damaging' with higher PhD-SNP probabilities. The disease associated PROVEAN, SNPs&GO, SNAP2, PMut and MutPred2 scores further classified these mutations as high-risk nsSNPs. In COL24A1, a single identified rs184448337, G>R had found associated with possibly damaging effects by PANTHER, with higher disease probabilities computed by PhD-SNP, PROVEAN, SNPs&GO, SNAP2, PMut and MutPred2 algorithms. In IQGAP2, rs140724393, Y>C with three different residue positions in 4 different transcripts according to the variability in sequence length were found associated with higher probabilities of pregnancy risks. Another SNP, i.e., rs200940839 have 5 different transcripts with four different residue-based variations (P>L/H) is regarded as higher disease-risk association probabilities. Third SNP rs34950321 exhibited in 6 different transcripts of the same substitution (T>I) of with 4 residue-level numbering had also found disease association. The last filtered variant of IQGAP2 was rs34968964 with six different transcripts. PATNHER predicted as probably damaging, while the rest of others were predicted as 'neutral' mutation, with least functional impact on protein   (Table 3). For SLIT2, the rs547207452 occurred in 5 different transcripts with five residue positions of G>R substitution computed as probably damaging and pathogenic in functional impacts except SNP&GO probabilities (Table 3).    Table 4). The ConSurf analysis predicted that the nsSNP of CNN1 gene (in all four transcripts) was highly conserved and exposed with highest score ranged between 8-9. The SOPMA secondary structural prediction characterized the residues position in random coil region of protein. The all four transcripts have highest proportion of coil (69-79%), with 43-64% of residues with few beta-strands and were exposed to surface (Table 5). This mutation causes the change of charge from positive to neutral with increase hydrophobicity of protein. The HOPE prediction showed that the mutant residue was smaller than the wild-type residue; this might lead to loss of interactions. Mutant residue was found near a highly conserved position, the mutation was located within a stretch of residues that was repeated in the protein as "Calponin-like 1". This mutation might disturb this repeat and consequently any function other this repeat performed. In COL24A1, the nsSNP rs184448337 (G>R) mutation also exhibited the decrease in stability of protein, as computed by the relevant software, which might affect the function and associated with disease occurrence (Fig 5 and Table 4). The SNP of COL24A1 possessed highest ConSurf score (9) located in random coil region of protein and were highly conserved and exposed ( Table 5). The alpha-helix and beta-strands were least in proportion as compared to coil which was 91.2% and 66.4% of residues were exposed to surface ( Table 5). The mutation results in the change of charge from neutral to positive which decrease the hydrophobicity. HOPE inference predicted that the mutant residue was bigger than the wild-type residue. The wild-type residue was more hydrophobic than the mutant residue. The mutation is found    within a domain, annotated in UniProt as "Collagen-like 11". The mutation introduces an amino acid with different properties, which can disturb this domain and abolish its function. The wild-type residue was a G, the most flexible of all residues. This flexibility might be necessary for the protein's function. Mutation of this G with R can obliterate this function. In IQGAP2 gene, the nsSNP rs140724393 resulted in the large destabilization of protein structure due to Y>C mutation, which might lead to severe pathophysiological consequences ( Fig 5 and Table 4). This mutation identified in all four transcripts was also found conserved, buried, and located in the extended strand of protein secondary structure. The higher secondary structural proportion was alpha-helix (68.9-76%) and around 59% of residues were exposed to surface (Table 5). HOPE inference predicted that the mutant residue is smaller than the wild-type residue. The mutant residue was more hydrophobic than the wild-type residue. This can result in loss of hydrogen bonds and/or disturb correct folding. The mutation is located within a domain annotated in UniProt as "Calponin-homology (CH)". The mutation introduces an amino acid with different properties which can disturb this domain and abolish its function. The mutated residue was in a domain that is important for binding of other molecules. Mutation of the residue might disturb this function. rs200940839 through various analytical tools was predicted as higher protein structural changes which could resulted in the destabilization of protein function (Fig 5 and Table 4). The mutation rs200940839 (P>H, L) identified in all five transcripts were found conserved, buried, and located in random coil, except (P1070H/L) in transcript ENST00000274364.11, which found exposed and moderately conserved. The higher proportion of secondary structure was alpha-helix (~69%) and around 59% of residues were exposed to surface ( Table 5). The mutant residue is bigger than the wild-type residue. The wild-type residue is more hydrophobic than the mutant residue which is proline. The mutation is located within a domain annotated in UniProt as "Ras-GAP". The mutation introduces amino acids with different properties which can disturb this domain and abolish its function. Prolines are known to be very rigid and therefore induce a special backbone conformation which might be required at this position. The mutation can disturb this special conformation. The mutated residue is located on the surface of a domain with unknown function. The mutation might cause loss of hydrophobic interactions with other molecules on the surface of protein.
The mutation rs34950321 (T>I) was predicted to cause higher protein structural changes which could resulted in the destabilization of protein function (Fig 5 and Table 4). This mutation was identified in six transcript that showed highly conserved, exposed area and present in the random coil region. The highest secondary structural proportion was alpha-helix (~68-88%) and~58% of residues were exposed to the surface (Table 5). According to the HOPE inference, the mutant residue was bigger than the wild-type residue and was more hydrophobic than the wild-type residue. The residue is buried in the core of a domain. The differences

PLOS ONE
between the wild-type and mutant residue might disturb the core structure of this domain. The hydrophobicity of the wild-type and mutant residue differs. The mutation will cause loss of hydrogen bonds in the core of the protein and as a result disturb correct folding. The mutation rs34968964 (E>Q) in all six transcripts were found highly conserved, exposed, and located in alpha-helix region of protein (Fig 5 and Table 4). Around 68-88% residues were alpha-helix with~59% of residues were exposed to surface ( Table 5). The HOPE inferred that the mutation converts the wild-type residue in a residue that does not prefer αhelices as a secondary structure. The charge was changed from negative to neutral and the mutated residue was located on the surface of a domain with unknown function. The residue was not found to be in contact with other domains of which the function is known within the used structure. However, contact with other molecules or domains will still possible but might be affected by this mutation. The charge of the wild-type residue was lost by this mutation and hydrophobicity increases. This can cause loss of interactions with other molecules. The protein stability tools also predicted the decrease in the stability of structure and function, as indicated by various energetic-based in silico estimation.
In SLIT2 gene, the mutation rs547207452 (G>R) in all five transcripts was not computed by ConSurf software, however, transcript IDs ENST00000273739.9, ENST00000503837.5, ENST00000622093.4 were in beta-turns and the rest two were with random coil region as this protein has >82% coil in secondary structure distribution. The~63% residues were exposed to surface, and mutation results in decreasing the hydrophobicity ( Table 5). The HOPE inference predicted that the mutant residue was bigger than the wild-type residue. The mutation was located within a domain annotated in UniProt as "Laminin G-like". The mutation introduces an amino acid with different properties which can disturb this domain and abolish its function. The wild-type residue is a G, the most flexible of all residues. This flexibility might be necessary for the protein's function. Mutation of this G can abolish this function which was also clear from the computation of various indicators to investigate the impact of mutation on protein function, leading towards disease progression.

Protein structural coverage identified genes expressed in female reproductive tissues
The protein-wise structural coverage of selected proteins was extracted from the PDB. The highest protein structural coverage of CNN1 was 41% with amino acids length of 136 residues positioned at 20-142 through NMR method (Table 6). However, IQGAP2 and SLIT2 have least coverage, 6-24% and 7-12%, respectively with existing templates. COL24A1 has no structural coverage with PDB templates (Table 6).

Homology modeling of CNN1 isoform 2
The secondary structure details of CNN1 isoform 2 showed that 29.60% (82 residues) formed alpha-helix, 14.80% (41 residues) were involved in beta-strand, 7.58% (21 residues) were involved in beta-turn and random coil were 48.01% (133 residues). CNN1 isoform 2 was found to be the highly expressed protein (in female reproductive tissues) with maximum structural coverage available on PDB database (Table 6). Therefore, the template selection for the homology modeling of CNN1 isoform 2 was carried by BLAST tool (Table 7). Top 5 alignment descriptions were used in which the highest similarity was seen with the solution structure of

PLOS ONE
CH domain of human calponin 1 isoform 1, PDB ID: 1WYP (43% query coverage, E value 2.00×10 −86 , and 94% identity) (Table 8). Afterwards, homology modeling was carried out by PHYRE2 the 3 rd ranked structure was identical with the BLAST high similarity results of CNN1 isoform 2 ( Table 8). The 3D stereochemical analysis and validation of 3D model of CNN1 was evaluated by the quality factor of ERRAT (73.68%) and 97 residues lied in the most favored region (S3 Table). Only 10 residues were in additional allowed region while no residue was found in the generously allowed or disallowed region. The Verify3D score was 83.74% which means that >80% residues have averaged 3D-1D score > = 0.2. Numbers of non-glycine and non-proline residues were 108 (glycine = 10 and proline = 2). The residual properties showed that maximum deviation observed was 2.6, the bond length/angle was 4.8, and the bond contact was 14. The G-factor analyses showed that dihedrals score was -0.01, covalent score was 0.69 with overall score of 0.26. Planar groups of 3D model showed that 100% were within the limit (S3 Table). The CNN1 isoform 2 exhibited a Calponin homology (CH) domain started at R10 and ended at L107. The low complexity region (LCR) of CNN1 was observed in two regions which are K114-K123 and A265-N275, respectively. Finally, we selected the highest percent similarity (94%) model of 123 aligned residues as our target protein for further investigation of this work (Table 8). For subsequent comparison with generated homology model through PHYRE2 tool, we downloaded the crystal structure of 1WYP from PDB database as template. The overall RMSD (root mean square deviation) value of homology model was 2.21 obtained through its superimposition with PDB 3D-template i.e., 1WYP. However, RMSD value for α-carbons was 2.27, back-bone atoms were 2.22 and heavy atoms were 2.35.

Prediction and validation of binding cavities by consensus mode
The binding cavities were validated CB-dock approach. The highest binding affinity (docking score) was examined through the in-silico interaction of homology model (CNN1 Isoform 2) with progesterone in cavity-graded pose rank 1 (-6.5 kcal/mole) (S4 Table). The detailed analysis of amino acids interaction was explored through 3D and 2D visualization of cavities interaction which reveals the interaction of R10, D84, S102, L105, K123 and Y124 (S4 Table).

Selection and screening of drug molecules involved in PTB management
From DrugBank database, initially 17 drugs were extracted as small drug molecules (S5 Table). All 17 drugs were screened in relevance with the drug targets information and their mode of action, ultimately reduced to 7 PTB-related drugs. The Lipinski's rule of five was then applied on all 7 screened drug molecules, 5 drugs fulfilled the completely criteria (S6 Table).

In-Silico molecular mechanics and energetic estimations of filtered PTBrelated drugs interacted with CNN1 isoform 2
The identified residues through CB dock approach were selected as key interacting residues and further used for in-silico analyses of all five drugs interactions with homology model (CNN1). The estimated energies and molecular mechanics of all five drugs showed a significant binding interaction with homology model (Table 9 and Fig 6). The highest binding energy was estimated with Retosiban (-9.43 kcal/mol), followed by Hydroxyprogesterone caproate (-8.19 kcal/mol). Allylestrenol and Ritodrine were also found with high binding energy, -7.56 and -7.39 kcal/mol, respectively. The overall in-silico pharmacokinetics predictions of 5 screened PTB-related drugs showed safest therapeutic properties with regards to the basic physicochemical characteristics of absorption, distribution, metabolism, excretion and toxicity indices (S7 Table).

Discussion
In this study we investigated the functional impact of 36 rare coding variants/rare-high risk nsSNPs through in-silico machine learning algorithms including Panther-PSEP, PhD-SNP, PROVEAN, SNP&GO, SNAP2, PMut and MutPred2 tools. Present findings through Panther-PSEP showed all 6 nsSNPs from CNN1, IQGAP2, and SLIT2 with the highest probability of deleteriousness, which may be related, as the longer a position has been preserved, the more likely that it will have a deleterious effect [57]. Interestingly, our results from PhD-SNP, SNPs&GO, PROVEAN, SNAP2, PMut, and MutPred2 showed all transcripts nsSNPs as pathogenic variants, which shows that these residue-based substitutions are associated with damaging effects on the biological function of PTB protein. However, there was some variation in a few transcript scores denoting lesser pathogenicity of mutations.

PLOS ONE
A genome-wide association study (GWAS) on Finnish population reported the occurrence of fetal SLIT2 genomic variation [58]. SLIT2 and ROBO1 gene expression in the cells of placenta and trophoblast may be linked with spontaneous PTB [58]. In this manner, variant annotations with elaborative approach regarding their role in PTB are needed to interpret the impact of point mutations which can lead to protein structural and functional changes, and may set the basis for the occurrence of APOs [59]. Our findings about point mutations that may influence protein stability can be useful in drug designing and prevention of PTB.
Predicting the effect of genomic variations on protein stability such as amino acid substitution is useful for screening of disease-causing SNPs [36] in proteins. Our findings reported the transcript-based computation of free energy changes of all seven rare nsSNPs of PTB-related genes having a decrement in protein stability, except the computed score of iPTREE-STAB of COL24A1 gene (rs184448337). This was further validated through ConSurf, SOPMA and HOPE tools which investigated these mutations, as they were highly conserved from the point of evolution.
CNN1, thin filament-associated protein known to regulate smooth muscle contraction, includes regulation, modulation of contraction processes, and activation of downstream pathways. It is reported to bind actin, calmodulin and tropomyosin [60,61]. The regulatory roles in both cytoskeletal dynamics and signaling of CNN1 are due to its CH (calponin homology) domain [61]. The role of CH domain in the onset of some diseases like cancers and asthma were reported [61]. The 3D protein model that we generated in this work, namely CNN1, has been reported as a biomarker for the prediction of preterm birth among pregnancies complicated by threatened preterm labor, in addition to cervical length measurement [62,63]. However, its protein functions and corresponding roles in diseases still needs to be explored, and this lack of literature will potentially link its contributing pathophysiological role in the onset of PTB. The identification of rare exonic variants as genetic markers for any disease with high relevance is a significant contributor for reporting the genetically associated risk of diseases [21]. All the four identified rare pathogenic variants of the CNN1 gene have pertinent amino acid substitution/variation observed beyond the reported structural coverage. This has been considered as an obstructing limitation of this work, which narrows the generalizability of the impact of the present results.
The current pharmacological basis for managing PTB with progesterone is well reported [64]. Unfortunately, the availability of information regarding the molecular or genomic basis of drug interactions to manage PTB (associated with multi-etiological factors) is not well documented and sets the stage for this research. Investigation of key active sites or interacting amino acids as potential targets (corresponding protein cavities) and their binding interactions with intervening bioactive compounds helps to explore PTB management challenges.
Drug target identification describes the process of identifying the direct molecular target available as protein or on nucleic acid, prominent for continuing interaction with usually small drug molecule [65]. An effective drug target needs to be therapeutically relevant to the disease phenotype and will not cause side effects by disrupting the homeostasis (physiological) mechanism/function of the target protein [65]. Hence, in this study the identification of potential druggable protein binding cavities showed the amino acids interaction (R10, D84, Q99, S102, L105, K123, Y124) with highest docking affinity, identified, confirmed through CB-dock [52] and COACH meta-server [53]. This strategy helps in the identification of PTB biomarkerrelated drugs interactions site present on our candidate gene target [65].
We selected five main drugs, including Hydroxyprogesterone caproate, Allylestrenol, Terbutaline, Ritodrine, and Retosiban with potential drug-likeness attributes, and performed residue specific molecular docking with our PTB target protein (CNN1 isoform 2). The obtained energetic estimate plays decisive role regarding the drug designing approach based on the binding interactions of PTB protein with therapeutic compound to identify protein targets through structural conformation held with non-covalent interactions [66]. The association of selected drugs used to manage PTB with the CNN1 isform2 protein 3D model showed an inhibitory impact on this target gene involved in PTB. The highest binding energetic estimation was observed in Hydroxyprogesterone caproate and Allylestrenol with increased attractive charge interaction at the D7 and D84 positions of Hydroxyprogesterone caproate. The commonest residues of CNN1 isoform 2 observed in the different kinds of chemical interactions of all five drugs were S102, L105, A106, K123, and Y124. The observed Pi-sigma interactions (Pialkyl and alkyl) may be due to large charge transfer involvement which helps in the intercalation of the binding of drugs with prominent amino acids of CNN1 isoform 2 [67]. The highest numbers of hydrogen bonds were formed in Ritodrine (Q4, S102, G121, K123) and Terbutaline (D84, S102, N119, G121, K123), respectively. However, Retosiban (K123) and Allylestrenol (A3) showed a single hydrogen bond interaction. This exhibits the strength of PTB protein-drug binding complex and its interaction with the possible therapeutic inhibitory action as it represents the non-covalent force of interaction (formed between various positively charged hydrogen atoms and the existing electron negative atom; nitrogen or oxygen-lone pairs of electrons) of the drug entity [68].
Drug discovery and drug development processes are preliminary linked with an in-silico investigation approach in assessing the pharmacokinetics (ADMET; absorption, distribution, metabolism, excretion, and toxicology) properties of the drug of interest. The in-silico screening markedly decreased the rate of attrition before subjecting the drug to clinical trials [56,69].
The investigated physicochemical properties of Terbutaline, Ritodrine, and Retosiban showed high drug solubility properties, while Allylestrenol and Hydroxyprogesterone caproate were less soluble with elevated distribution coefficient values. The consideration of drug solubility plays a significant role in the selection of drug molecule, and its evaluation as best possible usage [56,69]. Besides the molecular weight, the concept of hydrophobicity may affect the degree or extent of solubilization and drug interactions with compounds like bile acids, which are facial amphipathic as they contain both lipid soluble (hydrophobic) and hydrophilic (polar) faces, and are also critical for transport and absorption of fat-soluble compounds [69]. Similarly, high hydrophobicity was assessed in Hydroxyprogesterone caproate and Allylestrenol as the role of molecular hydrophobic interactions (the tendency of interaction of nonpolar molecules in body fluids) cannot be ignored. The hydrophobic interaction is one of the major interacting molecular forces resulting from the diffusion of drug molecules through the cell (plasma) membrane and binds to the internally localized receptor for gene expression or target action [56,69,70]. This might serve as the basis to understand the binding of Hydroxyprogesterone caproate and Allylestrenol (both showed high docking/binding scores) with our CNN1 isoform 2.
After administration of the drug, the main focus of interest in drug discovery is the availability strength of drug into the bloodstream and then how fast the required magnitude of drug reaches its specific target site (receptor) for the purpose of therapeutic action [56,69]. The Caco-2 (the human colon epithelial cancer cell line) permeability assay can be used to predict in vivo absorption of drugs, as it measures the quantification rate of a drug across polarized Caco-2 mono-layers [56,69]. The in-silico outcome for Caco-2 permeability regarding human intestinal absorption of selected PTB-drugs showed the optimal range, further validated through HIA (human intestinal absorption) pharmacokinetic percentage scores [56,69].
Bioavailability represents the administered drug fraction which reaches the systematic (bloodstream) circulation as a sub-category of drug absorption variable/factor [69]. Hydrophobic drugs are considered as the most favourable class of drug choice as they alter the plasma membrane fluidity to increase the passive trans-cellular drug permeation [70,71]. This tendency also modulates the tight junction across the cell membrane to allow increased paracellular diffusion of drug molecules [56,69]. The in-silico pharmacokinetics for the bioavailability (F 30%) of five PTB drugs showed the highest values for Ritodrine (64.30%), Allylestrenol (48.20%), and Retosiban (44.50%). Terbutaline (26.20%) and hydroxyprogesterone caproate (19.30%) had lower range values. This generally strengthens our point of selection of small molecules and should be replicated in future novel drug compounds for finding best therapeutic role in relation for the management of PTB.
The investigational outcome regarding volume distribution of all five screened PTB-related drugs showed an evenly distributed pattern, and the drug Ritrodine is highly lipophilic in nature, among the rest. Except for Terbutaline, the plasma protein binding interaction of all four other drugs was satisfactory as it indicates the high binding affinity with CNN1 isoform 2, also cleared from the in-silico drug binding energetic estimates. When a drug enters the systematic (bloodstream) circulation, it is distributed to the different tissues of the body. The rate of drug entry into the tissues mainly depends on the rate of blood flow to the respective tissues and the various partition characteristics between blood and tissue. Drugs reach the CNS (central nervous system) through brain capillaries and CSF (cerebrospinal fluid), but the restriction of drug penetration to this vicinity is mainly regulated by the brain's permeability characteristics [56,69]. Allylestrenol and Hydroxyprogesterone caproate showed the highest blood brain barrier (BBB+++) indications. This may be due to the rate of drug penetration into CSF and its further categorization by the extent of protein binding, degree of drug-ionization, and the lipid-water partition coefficient of the drug molecules [69]. The significance of the slow penetration rate of drugs in the brain is mainly associated with high protein binding and this is again validated in the case of allylestrenol and hydroxyprogesterone caproate.
When any drug or pharmacological compound is administered to the body by mouth, the first pass metabolism occurs in the liver tissues. This phenomenon typically occurs when a drug gets metabolized at a particular location in the body that ultimately results in the reduced active concentration of the drug upon reaching its target site of action or the systemic circulation. This pre-systemic metabolism may greatly reduce the bioavailability of the drug and serve as the basic justification for most of the drug's withdrawal from the market, in addition to the reported drug-induced liver injury. Similarly, cytochrome P450 (CYP-450) serves as a super-family of proteins containing heme as a cofactor, which constitutes its significant role in metabolism and detoxification of xenobiotic [56,69]. During the drug metabolism process, the quantity of drug required for effective concentration to exhibit therapeutic action should not be quickly metabolized by CYP-450. Hence, the CYP-450 enzyme is fundamentally used for understanding the metabolism of drugs through oxidizing substances by Fe (iron) and can metabolize a large variety of xenobiotic substances. CYP-450 can be inhibited or induced by the administration of certain drugs, which may lead to drug-drug interactions with unexpected adverse reactions or potential therapeutic failures [69]. Therefore, the substantial information about the drugs' metabolism based on CYP-450 (the most potent inhibiting and inducing drugs) assessment can be beneficial in reducing the possible drug's adverse reactions and interactions. Our present predicted pharmacokinetics findings related to selected PTB-drug metabolism showed the significant impact of all five therapeutic compounds.
The half-life time and clearance rate of any drug play a vital role in the drug development process as it involves the removal of drug/substance from the body either as bio-transformed metabolites or in unchanged form (intact drug form). This whole process is supported by a variety of factors which impact drug excretion, such as intrinsic drug properties (polarity, pH, and molecular mass) and may induce kidney tissue injuries [56,69]. Moreover, LD50 is an important measure of drug toxicity; a drug with a small LD50 value is more toxic than a drug with a high LD50 value [56,69]. The predicted pharmacokinetics results indicated the satisfactory value range for all five screened PTB-drugs.
In future, a follow up study may look for rationalization and optimization of PTB protein/ biomarkers (obtained from the present effort) through in-vitro challenge by using variable levels of perturbing variants of PTB. Furthermore, this study will lay the groundwork for future research into the gene-associated quest to better understand the role of CNN1 protein, particularly its CH domain, specifically probing its relevance with PTB via functional genomewide sequencing and detailed molecular analysis. Finally, the role of identified consensus interacting residues of this in-silico experimentation will be further tested with novel derivatives, which will pave the way for discovering new therapeutic compounds to manage the challenge of PTB.

Conclusion
PTB is an important perinatal health issue and addressing this global health problem through in-silico genes-based strategies not only developed a better understanding to improve access to cost effective obstetric and neonatal care but also reduce the neonatal and childhood mortality. This research work comprising the in-silico transcript-based analyses of nsSNPs of CNN1, COL24A1, IQGAP2 and SLIT2 genes highlighted that 7 nsSNPs with 36 different transcripts have shown the most pathogenic among all known and reported nsSNPs. These nsSNPs possess marked energetic differences in terms of their impact on the stability of protein. The structural docking-based experimentation of CNN1 gene and its molecular interaction (S102, L105, A106, K123, Y124) analysis could serve as an intervention target for the prevention of PTB. Therefore, on a country level, if more efforts are made in terms of exploring novel genetic variations to identify non-synonymous pathogenicity from the same set of genes with further development and incorporation of in-silico tools will pave the way for managing PTB.
Supporting information S1 Table. ALFA frequencies of rare nsSNPs of PTB-related genes.